Lyapin Artur Mansurovich, Postgraduate student, Penza State University (40 Krasnaya street, Penza, Russia),
Background. The subject of the study are short messages from public sources such as social networks, forums, public sms messages, with geodata about the status of the user at the time of publication of the messages. The subject of the study is the
classification of short messages using methods of data mining and comparative analysis of the methods of "nearest neighbours" and "naive Bayes". The aim of the work is to develop a methodology for data mining, which allows you to classify messages without learning the system. Experimental verification of the developed methodology on the collection of data from social networks, with the aim of interfering with external work in the road and transport infrastructure.
Materials and methods. Research of the processing of short text messages from public sources for the purpose of classifying road incidents have been carried out using data mining techniques. Data sets for the experimental system are taken from
thematic forums and groups of social networks and news sites.
Results. The methodology based on ensemble of methods of data mining, which allows to classify short text messages without preliminary training of the system. A computer program is developed on the basis of the proposed methodology, which classifies data from public sources and displays received messages with attached geodata on the map of Penza.
Conclusions. Comparative analysis of two data processing methods showed that the method of "nearest neighbors" allows to achieve greater accuracy on the test data set compared to the "naive Bayes" method. This also confirms the assertion that
machine learning methods can be successfully applied to the processing of short text messages of a different nature and in different spheres. Along with this, it was revealed that information received from social networks and SMS messages is valuable for determining the reaction of road users in real time.
1. Vikipediya. Available at: https://en.wikipedia.org/wiki/Main_Page (accessed Jan. 14, 2018).
2. VKontakte. Available at: https://vk.com (accessed Jan. 10, 2018).
3. Professional'nyy informatsionno-analiticheskiy resurs, posvyashchennyy mashinnomu obucheniyu, raspoznavaniyu obrazov i intellektual'nomu analizu dannykh “MachineLearning.ru” [Professional information and analytical resource dedicated to machine
learning, pattern recognition and data mining “MachineLearning.ru”]. Available at: http://www.machinelearning.ru. (accessed Jan. 16, 2018).
4. Merkov A. B. Vvedenie v metody statisticheskogo obucheniya [Introduction to statistical methods]. Moscow: Editorial URSS, 2011, 254 p.
5. Blog kompanii “Open Data Science” [Blog of the “Open Data Science” company]. Available at: https://habrahabr.ru/company/ods/ (accessed Jan. 17, 2018).
6. Marmanis Kh., Babenko D. Algoritmy intelektual'nogo Interneta. Peredovye metodiki sbora, analiza i obrabotki dannykh [Algorithms of the intellectual Internet. Advanced techniques for collecting, analyzing and processing data]. Saint-Petersburg: Simvol-Plyus, 2011, 480 p.